library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(caret)
library(plotly)
library(randomForest)
library(nnet)
We will now focus our attention on seagrass, and aim to build a classification algorithm to predict presence or absence of any seagrass species in the Great Barrier Reef. We will use location, types of sediment, and types of seabed to predict the presence of four seagrass species: Cymodocea serrulata, Syringodium isoetifolium, Thalassia hemprichii, and Zostera muelleri (subspecies capricorni).
# read data
setwd("cleaned_data")
seagrass <- read.csv("seagrass_classification_data.csv", as.is =TRUE)
# factorize relevant variables
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)
# view data
head(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH SEDIMENT TIDAL
## 1 Z_CAPIRCOR -23.65857 151.1206 0 Mud intertidal
## 2 Z_CAPIRCOR -23.65840 151.1212 0 Mud intertidal
## 3 Z_CAPIRCOR -23.65898 151.1203 0 Mud intertidal
## 4 Z_CAPIRCOR -23.65970 151.1197 0 Mud intertidal
## 5 Z_CAPIRCOR -23.65884 151.1205 0 Mud intertidal
## 6 Z_CAPIRCOR -23.65993 151.1197 0 Mud intertidal
summary(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH
## C_SERRULAT: 1256 Min. :-24.20 Min. :142.5 Min. : 0.0000
## S_ISOETIFO: 101 1st Qu.:-23.79 1st Qu.:146.8 1st Qu.: 0.0000
## T_HEMPRICH: 823 Median :-22.41 Median :150.5 Median : 0.0000
## Z_CAPIRCOR:10201 Mean :-21.06 Mean :149.0 Mean : 0.9253
## 3rd Qu.:-19.17 3rd Qu.:151.3 3rd Qu.: 1.2279
## Max. :-10.75 Max. :151.9 Max. :29.3583
##
## SEDIMENT TIDAL
## Gravel: 1 intertidal:7433
## Mud :8969 subtidal :4948
## Reef : 63
## Rock : 66
## Rubble: 107
## Sand :3151
## Shell : 24
First we plot seagrass according to location (latitude and longitude). Then we will add a third axis (water depth) to visualize this in 3-dimensions using the plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.
seagrass %>%
ggplot() +
geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES)) +
ggtitle("Australia, Northeast Coast (Great Barrier Reef)")
plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
We can also see how our categorical predictors, sediment and seabed type, relate to the number of observed seagrass species:
seagrass %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
seagrass %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
Our first attempt at the classifier (to predict presence of any seagrass species) will use random forest. We will first partition the data set into a training and testing set. Since we have over 12,000 observations, we can allocate 75% of the data to the training set.
# train-test split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)
train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
# fit RF using relevant features, with mtry~p/3
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL,
data=train_set, mtry = 2)
# build vector of predictions
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
# view performance metrics
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
##
## true
## pred C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 232 11 10 24
## S_ISOETIFO 3 6 0 0
## T_HEMPRICH 5 5 193 3
## Z_CAPIRCOR 74 3 2 2523
##
## Overall Statistics
##
## Accuracy : 0.9548
## 95% CI : (0.9468, 0.9618)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8458
##
## Mcnemar's Test P-Value : 4.663e-07
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.73885 0.240000 0.94146
## Specificity 0.98381 0.999022 0.99550
## Pos Pred Value 0.83755 0.666667 0.93689
## Neg Pred Value 0.97089 0.993841 0.99584
## Prevalence 0.10149 0.008080 0.06626
## Detection Rate 0.07498 0.001939 0.06238
## Detection Prevalence 0.08953 0.002909 0.06658
## Balanced Accuracy 0.86133 0.619511 0.96848
## Class: Z_CAPIRCOR
## Sensitivity 0.9894
## Specificity 0.8548
## Pos Pred Value 0.9696
## Neg Pred Value 0.9451
## Prevalence 0.8242
## Detection Rate 0.8154
## Detection Prevalence 0.8410
## Balanced Accuracy 0.9221
We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, this model got quite a low sensitivity for S_ISOETIFO. Recall from our data wrangling process that S_ISOETIFO had only about 100 “Yes” observations. Since we thus had low variation in the values of S_ISOETIFO relative to the other three seagrass species, this may have contributed to the low sensitivity.
We can visually assess our predicted values with true values of species to see how our model performed.
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")
These plots look fairly similar across the training and test sets, for all four species.
Below we see that across sediment types, the species predictions (rf_pred) appear very similar to the true species distribution (SPECIES). The same is true across seabed types, in the two bar graphs further below.
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=rf_pred)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=rf_pred)) +
geom_bar(width=.5, position = "dodge")
Finally, we examine variable importance using the Gini coefficient:
variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) +
geom_bar(stat='identity') +
coord_flip() + xlab("Feature") +
theme(axis.text=element_text(size=8))
We see that longitude and latitude were very predictive of presence of seagrass followed by water depth. The types of sediment and seabed are not very important predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.
We will now try a multinomial logistic regression model to see how it compares to the random forest we fit above. We will use the nnet package. The logistic regression model is as follows:
# fit model
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights: 44 (30 variable)
## initial value 12874.515732
## iter 10 value 3247.989443
## iter 20 value 2669.068404
## iter 30 value 2546.836953
## iter 40 value 2474.509681
## iter 50 value 2458.621441
## iter 60 value 2431.044865
## iter 70 value 2430.887870
## iter 80 value 2430.768859
## iter 90 value 2428.784786
## iter 100 value 2428.631883
## final value 2428.631883
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT,
## data = train_set)
##
## Coefficients:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO -350.6681 2.2716446 3.014258 -0.03157577 -53.283832 -53.62843
## T_HEMPRICH -133.9094 1.8104662 1.290479 -0.30495007 -25.017614 -22.20410
## Z_CAPIRCOR -191.1202 0.6697678 1.472643 -0.79180158 -9.913188 -60.52610
## SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO -52.32849 -87.37784 -52.54337 -51.50610
## T_HEMPRICH -21.68547 -20.77540 -22.06726 -22.15954
## Z_CAPIRCOR -12.86058 -85.57933 -10.79864 -11.44234
##
## Std. Errors:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO 0.0023636052 0.05375936 0.007463262 0.02633879 0.4056890
## T_HEMPRICH 0.0007594124 0.05381106 0.006821702 0.03759962 0.2780858
## Z_CAPIRCOR 0.0013655214 0.02863418 0.004343490 0.02906978 0.2949554
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 9.114478e-01 0.7479946 3.716063e-14 0.3850719 0.9699091
## T_HEMPRICH 4.574640e-01 0.6054729 4.008084e-01 0.2488036 0.8517372
## Z_CAPIRCOR 1.547949e-21 0.5936871 1.011638e-32 0.2968752 0.7001492
##
## Residual Deviance: 4857.264
## AIC: 4911.264
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")
# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")
# classification performance metrics
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
##
## Reference
## Prediction C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 140 14 7 26
## S_ISOETIFO 1 0 0 0
## T_HEMPRICH 11 2 187 17
## Z_CAPIRCOR 162 9 11 2507
##
## Overall Statistics
##
## Accuracy : 0.916
## 95% CI : (0.9056, 0.9255)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6921
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.44586 0.0000000 0.91220
## Specificity 0.98309 0.9996742 0.98962
## Pos Pred Value 0.74866 0.0000000 0.86175
## Neg Pred Value 0.94014 0.9919172 0.99374
## Prevalence 0.10149 0.0080802 0.06626
## Detection Rate 0.04525 0.0000000 0.06044
## Detection Prevalence 0.06044 0.0003232 0.07014
## Balanced Accuracy 0.71448 0.4998371 0.95091
## Class: Z_CAPIRCOR
## Sensitivity 0.9831
## Specificity 0.6654
## Pos Pred Value 0.9323
## Neg Pred Value 0.8938
## Prevalence 0.8242
## Detection Rate 0.8103
## Detection Prevalence 0.8691
## Balanced Accuracy 0.8243
We see that our multinomial logistic model has about 91% overall accuracy, which is worse performance than the random forest. The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. It peforms well for Z_CAPRICOR as well, but performs relatively poorly for C_SERRULAT and even worse for S_ISOETIFO with 0% sensitivity. Again, we can assess our results visually:
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")
In this case, the truth and prediction plots look different, particularly in that the prediction plot does not appear to show any instances of S_ISOETIFO. This makes sense, given the small sample size and low variability in the values of the S_ISOETIFO variable.
Below we see that across sediment types, the species predictions (rf_pred) appear fairly similar to the true species distribution (SPECIES). This is not true across seabed types, in the two bar graphs further below - there we see that S_ISOETIFO is never predicted, and C_SERRULAT is predicted less often than in the truth.
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=predicted_class)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=predicted_class)) +
geom_bar(width=.5, position = "dodge")
The overall accuracy for the multinomial logistic regression model was 90.9%, and that of the random forest model was 95.6%. While these may seem fairly close, the accuracy for the multinomial logistic model is deceiving since it performed particularly poorly in terms of sensitivity in 2 out of 4 classes.